suppressPackageStartupMessages({
  import(rpkgs)
})
import(db)

Deriving features for “market effects”

The target variable in this competition is

\[ R(t) = \texttt{ln}\left(\frac{P(t+16)}{P(t+1)} - M(t)\right) \]

where

Put simply, it’s “(return of the asset) minus (return of the market)”.

Without any features for this, the model would have to infer the market state from the features for a particular asset, makes its job harder.

Instead, lets help the model out by providing some information about the whole market.

Features

windowSize = 42 # minutes
df = getQuery(glue('
WITH
t1 AS (
  SELECT
    ts
  , close - AVG(close)
    OVER (PARTITION BY asset_id ORDER BY ts ASC ROWS {windowSize} PRECEDING)
    AS close_centred
  , STDDEV_SAMP(close)
    OVER (PARTITION BY asset_id ORDER BY ts ASC ROWS {windowSize} PRECEDING)
    AS close_stddev
  FROM trn
),
t2 AS (
  SELECT
    *
  , close_centred / NULLIF(close_stddev, 0)
    AS close_std
  FROM t1
),
t3 AS (
  SELECT
    ts
  , AVG(close_std) AS market_close_std
  , AVG(close_stddev) AS market_close_stddev
  FROM t2
  GROUP BY ts
),
t4 AS (
  SELECT
    *
  , REGR_SLOPE(market_close_stddev, EXTRACT(EPOCH FROM ts)::REAL)
    OVER (ORDER BY ts ASC ROWS {windowSize} PRECEDING)
    AS market_close_stddev_slope
  , LEAD(market_close_std, {windowSize})
    OVER (ORDER BY ts ASC)
    AS lead_market_close_std
  FROM t3
)
SELECT * FROM t4
'))

df
##                           ts market_close_std market_close_stddev
##       1: 2018-01-01 00:01:00               NA                  NA
##       2: 2018-01-01 00:02:00       -0.5303301            2.211614
##       3: 2018-01-01 00:03:00       -0.4922156            3.436598
##       4: 2018-01-01 00:04:00       -0.8753722            4.928739
##       5: 2018-01-01 00:05:00       -1.1223229            6.812861
##      ---                                                         
## 1956778: 2021-09-20 23:56:00        0.4664380           14.281667
## 1956779: 2021-09-20 23:57:00       -0.1911621           13.543799
## 1956780: 2021-09-20 23:58:00       -0.3688219           13.043447
## 1956781: 2021-09-20 23:59:00        1.2012725           12.515743
## 1956782: 2021-09-21 00:00:00        0.9615094           12.247756
##          market_close_stddev_slope lead_market_close_std
##       1:                        NA            -0.8035920
##       2:                        NA            -0.2465318
##       3:                        NA            -0.5903130
##       4:               0.016442450            -0.5045341
##       5:               0.023802300            -0.3217040
##      ---                                                
## 1956778:              -0.004357322                    NA
## 1956779:              -0.004413498                    NA
## 1956780:              -0.004460375                    NA
## 1956781:              -0.004514844                    NA
## 1956782:              -0.004556270                    NA

Time series plot

set.seed(25582)
viewSize = 200
randomDateIdx = sample(viewSize:nrow(df), 1)
randomDateIdxs = (randomDateIdx - viewSize):randomDateIdx
p =
  df[randomDateIdxs,] |>
  melt(id.vars = "ts") |>
  ggplot(aes(ts, value)) +
  geom_line() +
  facet_wrap(~variable, ncol = 1, scales = "free")

ggplotly(p)

Correlation plots

Compare the correlation against market_close_std 42 minutes in the future.

pp = list()
for (f in c("market_close_std", "market_close_stddev", "market_close_stddev_slope")) {
  featureCorrelation = cor(df[,get(f)], df[,lead_market_close_std], use = "complete.obs")
  p =
    df |>
    ggplot(aes_string(f, "lead_market_close_std")) +
    geom_bin_2d() +
    labs(
      title = glue('{f} -> lead_market_close_std (corr: {featureCorrelation})')
    )
  
  pp[[f]] = p
}
ggplotly(pp[["market_close_std"]])
## Warning: Removed 45 rows containing non-finite values (stat_bin2d).
ggplotly(pp[["market_close_stddev"]])
## Warning: Removed 44 rows containing non-finite values (stat_bin2d).
ggplotly(pp[["market_close_stddev_slope"]])
## Warning: Removed 46 rows containing non-finite values (stat_bin2d).

The linear correlations are not very strong, will be interesting to see what the model does with it.